{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistics Regression\n",
    "\n",
    "本文要讨论一个分类模型，Logistics Regression，中文名通常为逻辑回归，名字中虽然包含回归，但却实实在在是个分类模型。\n",
    "\n",
    "## 模型定义\n",
    "\n",
    "逻辑回归的模型可以很自然地从线性回归中引入，在二分类任务中，可以把样本分为正负两类，我们可以根据样本的特征来预测它为正例的概率。设样本为正的概率为 $P(x)$，可以得出以下模型：\n",
    "\n",
    "$$\n",
    "P(x) = w·x + b\n",
    "$$\n",
    "\n",
    "这里整个模型还是线性回归模型，只不过线性回归模型的目标是样本为正例的概率。有了这个模型后，对于所有正例样本，可以设为正例的概率为 1，否则为 0，然后使用线性模型来训练。\n",
    "\n",
    "事实证明这个模型也是可以工作的，但不够好，线性回归模型的输出可以是任意实数，但概率很显然只能是 0~1 之间的数。\n",
    "\n",
    "![](https://wangyu-name.oss-cn-hangzhou.aliyuncs.com/superbed/2019/10/02/5d948f68451253d17800c6b9.jpg)\n",
    "\n",
    "上图中的函数名叫 sigmoid 函数，它的特点是对任意输入，输出都在 0~1 之间。因此，可以将前面式子中线性回归的输出在传入 sigmoid 函数，得到最终的输出。\n",
    "\n",
    "sigmoid 函数的定义为：\n",
    "\n",
    "$$\n",
    "S(z)=\\frac{1}{1+e^{-z}}\n",
    "$$\n",
    "\n",
    "\n",
    "因此整个模型的定义如下：\n",
    "\n",
    "$$\n",
    "P(x)=\\frac{1}{1+e^{-(wx+b)}}\n",
    "$$\n",
    "\n",
    "上面就是逻辑回归的模型定义了。\n",
    "\n",
    "## 模型求解\n",
    "\n",
    "设正样本的 label 为 1，负样本的 label 为 0，则预测样本为 1 或 0 的概率为：\n",
    "\n",
    "$$\n",
    "P(y=1|x) = P(x)\n",
    "$$\n",
    "\n",
    "$$\n",
    "P(y=0|x) = 1-P(x)\n",
    "$$\n",
    "\n",
    "两者结合：\n",
    "\n",
    "$$\n",
    "P(y|x) = P(x)^y(1-P(x))^{1-y}\n",
    "$$\n",
    "\n",
    "对于所有样本 $(x_i, y_i)$ 可以写出似然函数：\n",
    "\n",
    "$$\n",
    "L(w,b) = \\prod\\limits_{i=1}^{n}(P(x^{(i)}))^{y^{(i)}}(1-P(x^{(i)}))^{1-y^{(i)}}\n",
    "$$\n",
    "\n",
    "记损失函数为：\n",
    "\n",
    "$$\n",
    "J(w, b) = -lnL(w, b)\n",
    "$$\n",
    "\n",
    "模型参数 $w$ 和 $b$ 的可以使用梯度下降法求解。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 再谈模型定义\n",
    "\n",
    "前面我将逻辑回归和线性回归差不多，只不过逻辑回归中拟合的是样本为正的概率。但并不能说明逻辑回归为啥叫逻辑回归。这里将揭示逻辑回归真正的由来。\n",
    "\n",
    "一件事情的几率（odds）是该事件发生与不发生的概率的比值，如果事件发生的概率为 $p$，那么事件的几率（odds）为：\n",
    "\n",
    "$$\n",
    "odds = \\frac{p}{1-p}\n",
    "$$\n",
    "\n",
    "几率（odds）的对数被称为对数几率，记为：\n",
    "\n",
    "$$\n",
    "logit(p) = log \\frac{p}{1-p}\n",
    "$$\n",
    "\n",
    "几率（odds）的值域为 $[0, \\infty]$，logit 的值域为 $[-\\infty, \\infty]$，当 logit 无穷小时，说明样本为正的概率为 0，当 logit 为无穷大时，说明样本为正的概率为 1。\n",
    "\n",
    "所有逻辑回归模型定义如下：\n",
    "\n",
    "$$\n",
    "log \\frac{p}{1-p} = w·x + b\n",
    "$$\n",
    "\n",
    "解出 p(x) 为:\n",
    "\n",
    "$$\n",
    "p(x) = \\frac{1}{1+e^{-(wx+b)}}\n",
    "$$\n",
    "\n",
    "逻辑回归，是对 logit 函数进行回归，因此成为 logistics regression。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 逻辑回归实践"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "生成样本\n",
    "\"\"\"\n",
    "\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "X, y = make_classification(n_samples=10000, n_features=10, n_informative=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "划分训练集和测试集\n",
    "\"\"\"\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='warn',\n",
       "          n_jobs=None, penalty='l2', random_state=None, solver='sag',\n",
       "          tol=0.0001, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "使用逻辑回归模型\n",
    "\"\"\"\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "lr = LogisticRegression(solver='sag')\n",
    "lr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8795"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "在测试集上评估\n",
    "\"\"\"\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "y_pred = lr.predict(X_test)\n",
    "\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 推荐阅读\n",
    "\n",
    "- 李宏毅老师关于分类的课程 [Classification: Logistic Regression](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html)\n",
    "- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow 第三章和第四章"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
